Web graph compression with fast access

نویسنده

  • Filip Proborszcz
چکیده

In recent years studying the content of the World Wide Web became a very important yet rather difficult task. Fast resource searching engines need to explore the Internet and index its whole structure while still being profitable i.e. maintaining low machine's resource consumption that implies lower costs. In order to achieve this task a new structure was introduced, a web graph. Since the Internet consists of billions of pages it is rather hard to enclose it in RAM in order to process it fast. Therefore there is a need for a compression technique that would allow a web graph representation to be put into the memory while maintaining random access time competitive to the time needed to access uncompressed web graph on a hard drive. There are already available techniques that accomplish this task, but there is still room for improvements and this thesis attempts to prove it. It includes a comparison of two methods contained in state of art of this field (BV and k 2 partitioned) to two already implemented algorithms (rewritten, however, in C++ programming language to maximize speed and resource management efficiency), which are LM and 2D, and introduces the new variant of the latter one, called 2D stripes. This thesis serves as well as a proof of concept. The final considerations show positive and negative aspects of all presented methods, expose the feasibility of the new variant as well as indicate future direction for development. wskazuje kierunek jego rozwoju. The main goal of this thesis is to elaborate on techniques of compressing web graphs that allow fast access to the successor list of a node and to show positive and negative aspects of proposed methods and other known ones as well as their practical applications. In this work three methods were evaluated and tested: LM (for list merge), 2D-stripes and 2D-nostripes. They were then compared to BV [1], the most popular compressed web graph representation in the market, and k 2 partitioned [2], which is also very promising. This thesis brings the following contributions: 1. Investigation of currently available tools and algorithms for compressing web graphs, 2. Attempt to translate LM method, which is currently available in Java, into C++ in order to check its efficiency among other C++ solutions, 3. Elaboration of demanded features of the new solution, 4. Analysis of possible advantages and drawbacks of each idea and choosing most suitable algorithms for …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Link Database: Fast Access to Graphs of the Web

The Connectivity Server is a special-purpose database whose schema models the Web as a graph graph where URLs are nodes and hyperlinks are directed edges. The Link Database provides fast access to the hyperlinks. To support a wide range of graph algorithms, we find it important to fit the Link Database into memory. In the first version of the Link Database, we achieved this fit by using machine...

متن کامل

A model for specification, composition and verification of access control policies and its application to web services

Despite significant advances in the access control domain, requirements of new computational environments like web services still raise new challenges. Lack of appropriate method for specification of access control policies (ACPs), composition, verification and analysis of them have all made the access control in the composition of web services a complicated problem. In this paper, a new indepe...

متن کامل

Tight and Simple Web Graph Compression

Analysing Web graphs has applications in determining page ranks, fighting Web spam, detecting communities and mirror sites, and more. This study is however hampered by the necessity of storing a major part of huge graphs in the external memory, which prevents efficient random access to edge (hyperlink) lists. A number of algorithm involving compression techniques have thus been presented, to re...

متن کامل

Delta-K 2-tree for Compact Representation of Web Graphs

The World Wide Web structure can be represented by a directed graph named as the web graph. The web graphs have been used in a wide range of applications. However, the increasingly large-scale web graphs pose great challenges to the traditional memory-resident graph algorithms. In the literature, K-tree can efficiently compress the web graphs while supporting fast querying in the compressed dat...

متن کامل

Graph Compression by BFS

The Web Graph is a large-scale graph that does not fit in main memory, so that lossless compression methods have been proposed for it. This paper introduces a compression scheme that combines efficient storage with fast retrieval for the information in a node. The scheme exploits the properties of the Web Graph without assuming an ordering of the URLs, so that it may be applied to more general ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1304.7355  شماره 

صفحات  -

تاریخ انتشار 2012